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SUMMATION TEST FOR GAP PENALTIES AND STRONG LAW 
OF THE LOCAL ALIGNMENT SCORE 1 

By Hock Peng Chan 

National University of Singapore 

A summation test is proposed to determine admissible types of 
gap penalties for logarithmic growth of the local alignment score. 
We also define a converging sequence of log moment generating func- 
tions that provide the constants associated with the large deviation 
rate and logarithmic strong law of the local alignment score and the 
asymptotic number of matches in the optimal local alignment. 

1. Introduction. In protein and DNA sequence matching, two sequences 
of length m and n are aligned to determine if they have a segment each 
that is significantly matched. A local alignment score is assigned according 
to the quality of the matches in the alignment subtracted by penalties for 
gaps present within the alignment. The gap penalty is of the form A + ^(k) 
[with 7(1) =0], for a gap of length k. The choice of A, also known as the 
gap initialization penalty, reflects our belief in the frequency of segment 
insertion/deletion in the evolutionary process; while the choice of 7 reflects 
our belief in the distribution of the length of the segment that is inserted 
into or deleted from DNA or protein sequences. 

The affine gap penalty function corresponds to j(k) = 5{k — 1) for some 
5 > and is currently the most popular in sequence alignment programs 
(cf. BLAST; [2]). Part of its popularity can be attributed to the recursive 
Smith- Waterman algorithm (cf. [17]) that allows the local alignment score 
to be computed in 0(mn) time (cf. [11]). Much research has been done to 
understand the asymptotic behavior of the local alignment score for affine 
gap penalties. In [3], it was shown that the gap penalties can be essentially 
divided into two types; according to whether the local alignment score grows 
at a logarithmic rate or linear rate. Logarithmic rate of growth is statistically 
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desirable and, hence, the condition provided in the paper for determining 
logarithmic growth is useful in practice. 

Under the Hidden Markov Model (HMM) theory (cf. [10]), the local align- 
ment score for affine gap penalties is equivalent to the maximum likeli- 
hood score under the assumption that the length of segments inserted or 
deleted is geometrically distributed. This does not agree with extensive em- 
pirical studies (cf. [4, 12]) which show that a heavier tail distribution is 
more likely and suggests the use of long-range gap penalties that satisfy 
7(fc) = o(k). Common gap penalties that have been considered are the power 
law [y(k) = 5{k — l) a for some < a < 1] and the logarithmic [y(k) = 5 log k] 
gap penalties. Algorithms using 0(mn) time for computing the local align- 
ment score are available (cf. [14, 19] for global alignments and [15] for lo- 
cal alignments). However, there has so far been little understanding of the 
asymptotics of the local alignment score for these gap penalties and questions 
about the appropriateness of these scores for statistical analysis remains. 

Over the past decade, there have been many advances in the use of align- 
ment algorithms for the prediction of RNA secondary structure from primary 
sequences; see, for example, [5] and [18] for an overview of the underlying 
issues. The interaction energy of base pairings are used to determine the 
scores of similarity matrices, while unaligned regions are associated with 
loops, which require a logarithmic "loop energy" for their formation, and 
this supports the use of the logarithmic gap penalty function. The superior 
performance of the logarithmic and power law gap penalty functions in deriv- 
ing biologically meaningful optimal alignments was confirmed in a detailed 
study by Dewey [9] . In the alignment of weakly related proteins, it was also 
observed that long intervening loops are relatively nonconserved and best 
left unaligned (cf. [1]). Long-range penalty functions, which encourages long 
gaps, are suitable for this purpose. 

In Section 2, we provide a simple summation test for 7 that can deter- 
mine if logarithmic growth of the local alignment score is possible. Section 3 
extends these results into a strong law under the additional assumption that 
lim/ c ^ 00 7(A;)/log&; = 00. In Section 4 the asymptotic number of matches in 
the optimal alignment was also shown to obey a strong law. 

2. A summation test for gap penalties. Let A be a finite alphabet which 
can be used to represent either the twenty amino-acids in protein sequences 
or the four nucleotide bases in DNA sequences. Let K :A x A — > R be a 
similarity score matrix satisfying K(a,b) = K(b,a) for all a, b € A and let 
g : {0, 1, ...}—► [0, 00) with g(0) = be a nondecreasing, concave [i.e., g(k + 
1) — g(k) < g(k) — g(k — 1)] function. Let Z be the class of all nonempty 
candidate alignments z = {(i(t),j(t)) : 1 < t < u}, where < ■ ■ ■ < i(u) and 
< ••• < j (it) are positive integers. We shall use the notation z(it) to 
signify a candidate alignment with u pairs or matches. Throughout the text, 
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| • | shall denote the number of elements in a finite set and, in particular, 
\z(u) \ = u. Given sequences x m — x\ • • • x m and y n 

= yi---Uni where Xi, yj 6 

A for all i and j, we define 

u 

5z( M )( x mJn) =J2 K ( x i(t),yj(t)) 
t=l 

(2.1) 

u— 1 

- £ b(i(t + 1) - i(t) - 1) + J7(j(t + 1) - j(i) - 1)] 

t=l 

if i(^) < Ti and < n. For completeness, define <S z (u)(x m ,y n ) = — oo if 
either i(u) > m or > n. Let the local alignment score 

(2.2) H(x m ,y n ) = sup5 z (x m ,y n ). 

Under the null hypothesis of no relation between x m and y n , we assume that 
xi,X2, ••■ and y\,y%... are independent and identically distributed with 
probability measure /x satisfying /i(a) > for all a 6 A. Define /i(x m ) = 
rj^Li Ai(^i) and assume that 

(2.3) E[K( Xl , yi )}<0, K max :=maxK(a,b)>0. 

a,b£A 

The local alignment score for gapless alignments, denoted by H^, can be 
expressed in the form (2.1)-(2.2) by setting g{k) = oo for all k > 1. Its 
asymptotic behavior was studied in [7, 8]. Let 9* be the unique positive 
solution to the equation Eexp[9K(x\, y\)] = 1. It was shown that under 

(2.3) , i5foo(x n ,y n ) has an asymptotic Gumbel distribution and 

(2.4) ffoo(x n ,y n )/logn — ► 2/0* a.s. asn^oo. 
Analogous to (2.1)-(2.2), we define 

Rz(u) (%> Yn) = 5' z ( u )(x m ,y n ) -g(i(l) - 1) 

(2.5) 

- l)-9(m-i(u))-g(n-j(u)), 
(2.6) G(x m ,y n ) = supi? z (x m ,y n ). 

G is known as the global alignment score and differs from the local alignment 
score H in that unaligned letters both before and after the alignment z 
are penalized. If g is chosen such that (3 := lim n ^ 00 £'[G(x ri ,y n )/n] > 0, 
then H(x n ,y n )/n — > (5 in probability and the gap penalty is said to lie in 
the linear domain. Conversely, for f3 < 0, there exists T2 > t\ > such that 
hnin.^oo P{t2 > H(-x n ,y n )/ logn > t\} = 1 and the gap penalty is said to lie 
in the logarithmic domain (cf. [3], Lemmas 2 and 3). 

In some sequence alignment software, for example, XPARAL (cf. [13]), the 
user is required to specify gap penalties of the form g(k) = A + r y(k) for k > 1, 
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where *y(k) = 5 log k, j(k) = 5(k — l) 1 / 2 or ^(k) = 5(k — 1) for some 5 > 0. By 
Arratia and Waterman [3], it follows that if j(k) = 5(k — 1), then g lies in the 
logarithmic region if the gap penalty is chosen large enough. However, it is 
unclear for the cases "/(k) = 5logk and ~f(k) = 8{k — l) 1 / 2 that logarithmic 
growth of H is possible. Note that for these choices of 7, the constant (5 
is always nonnegative. This can be seen by considering an alignment with 
exactly one match. In Theorem 1, we provide a summation test that will 
allow us to determine the types of 7 for which logarithmic growth occurs 
when A is large. It formalizes a statement in [16], where a rough heuristic is 
used to suggest that gap penalties satisfying YjtLi ex P[ — 0*9$)] < 00 should 
be chosen for logarithmic growth of H. 

Theorem 1. Let g{k) = A + 7O) for k>l with 7(1) = 0. 

( a ) IfY^k=i ex P[ — ^7(^»)] < 00 f or some 9 < 9*, then g lies in the logarith- 
mic domain for all large A. 

(b) V Sfc^Li ex P[ — S"f(k)] = 00 for some 9 > 9*, then g lies in the linear 
domain for all A. 

(c) Let j(k) = 5logk for some 5 > 0. If 5 > 9~ 1 , then g lies in the log- 
arithmic domain for all large A. Conversely, if 5 < 9~ 1 , then g lies in the 
linear domain for all A. 

Let Z K = {z € 2 : (1, 1) G z, |z| = k}. Define 

(2.7) G K (x m ,y n ) = maxi? c ( 

For 9 > 0, define 

(2.8) h K {9)= J2 £exp[0G K (x m ,y n )], i/; K (9) = log h K (9). 

m,n>K 

To prove Theorem 1, we need to consider only k = 1, but the strong law 
results of Theorems 2 and 3 use the convergence of ip K (9)/K as k — ► 00. We 
preface the proof of Theorem 1 with Lemma 1, which uses an importance 
sampling scheme to achieve a change of measure. For k = 1 and g(k) = 
A + 5{k — 1), a modified version of this scheme was implemented in [6] for 
efficient simulation of P{H(x m ,y n ) > c}. 

Lemma 1. Let 9 K be a positive root of ip K (9) = (if it exists). Then 

(2.9) P{H(x m ,y n ) > c} < mnexp[9 K (K - l)K ma _ x ] exp(-6» K c). 

PROOF. Let us simulate (x m ,y n ) in the following manner: 

1. Initialization step. Simulate (i*, j*) uniformly from {1, . . . , m} x {1, . . . , n} 
and let Xi, yj ~ /1 for all i < and j < j*. Initialize partial sum 5 = 0. 

2. Repetition steps. 
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(a) Simulation. Simulate (v r ,w s ) from the measure v on the domain 

u^i^ur^i 8 with 

(2.10) v(v r ,w s ) = exp[6> K G K (v r ,w s )]^(v r )/i(w s ). 

Note that both r and s are random here, taking values in {k, k + 1, . . . }. 
Moreover, v is a probability measure because tp K (0 K ) = 0. 

(b) Check that segment generated is not too long. If + r — 1 > m or 
j* + s — 1 > n, go to step 3. Otherwise, proceed to (c). 

(c) Updating. Let x»» ■ • ■ Xj„+ r _i = v r , • • • y Jit+s _i = w s and let (new)(i», j*) = 
(old)(i*,j*) + (r,s). Let (new) 5 = (old)S , + G K (v r , w s ). If 5 > c- (k- l)iT max , 

go to step 3. Otherwise, repeat step 2. 

3. Conclusion step. Simulate Xi,jjj ~ \i for all i > i* and j > j*. 

Let Q denote the probability measure of (x m , y n ) simulated in this manner 
and let P(x m ,y n ) = /^(x m )/i(y n ). Equation (2.9) clearly holds when c < (k — 
1)-Kmax so we may assume without loss of generality that c> (k — l)K max . 
Let A = {(x m ,y n ):H(x m ,y n ) > c}. For all (x m ,y n ) 6 A, there exists zeZ 
such that <S z (x m ,y n ) > c. Since c> («— l)-fT max , it follows that z has at least 
k matches and can be expressed in the form z = {(i(t), j(t)) : 1 < t < Xk + g} 
for some A > 1 and < q < k. Let £ = {(i(i), : 1 — ^ — ^ K }> which is z 
without the last q matches. Since q < (k — 1) and 5 z (x. m ,y n ) > c, it follows 
that 

(2.11) £<(Xm,yn)>C-(ft-l)#max- 

We break-up • • -^(Ak) into A segments v^, . . . , v( A ) with vW = 

x i(i) ■ ■ -Xi( K+ iyi, = • • -a?i(2/e+i)-i> ■ • • , v (A_1) = x i( ( A _2) K+ i) • • • Xi((x-i)K+i)~i 

and for the last segment, = x i{{x _ 1)K+1) ■ ■ ■ x i{Xn) . Similarly, y j ( 1 )y j (i) +1 ■ ■ ■ y^xk) 
is broken up into A segments w^, . . . , w^ A ) where = yjm ■ • • 1> 

w(2) = Vj(K+i) ■ ■ -yj(2K+i)-u ■ ■ • and for the last segment = j/j((a-i)«+i) ■ 
(x m ,y n ) can be generated from the simulation scheme above if (i(l),j'(l)) 
is simulated in step 1 [as and (v W,wW) are generated on the 77th 

iteration of step 2(a). Since S^(x m ,y n ) < J2^=i G K (v^ T '\w^), it follows by 
(2.10) and (2.11) that 

f(xm,y„) > (nm)- 1 nKv^.w^l/Mv^lMw^)] 

r?=l 

(2.12) > (nm)~ 1 exp[6' K S , ? (x m ,y n ,)] 

> (nm) _1 exp{6> K [c- («- l)K max ]}. 

This holds for all (x. m ,y n ) G j4 and, hence, (2.9) follows from the identity 
P{A) = E[(dP/dQ)l A \. □ 
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Proof of Theorem 1. Since Z\ contains only the alignment {(1,1)}, 
it follows that Gi(x m ,y n ) = K{x\,yi) — g(m — 1) — g{n — 1) and, hence, by 
(2.8), 

nn "I 2 



(2.13) hi{6) 



k=l 



Eexp[eK(x uyi )}. 



Let J2T=i ex P[-07(k)] < oo for some < 9 < 6>*. Since Eexp[6K(x\, y\)] < 
1, we can find A large enough such that h\{9) < 1. By (2.3) and (2.13), 
ipi(6) = \ogh\{6) — > oo as 9 — > oo and, hence, ipi(6\) = for some 9\ > 9. By 
Lemma 1 with n = l, 

(2.14) lim P{i7(x n ,y„)>31ogn/0 1 } = O. 

n — >oo 

Since H > .Hoo, the gapless local alignment score, it follows from (2.4) that 
lim n _>oo P{H (x n , y n ) > logn/6**} = 1 and, hence, (a) follows from (2.14). 
The first part of (c) also follows from (a) by choosing 9 € (<5 -1 ,0*). 

We shall next show the second part of (c). Let g{k) = A + <51ogfc for some 
S< 6' 1 . Let vW = a; r (^_i) +1 ■■■x ry] and = y r ^-i) +1 • • • y rr] for 1 < 77 < A, 
where A and r are positive integers to be specified later. Then G(x r \, y r \) > 
E^iffoo^.w^) - 2(A + l)s(2r) and, hence, it follows from (2.4) that 
for any e > 0, there exists r large enough such that 

E[G(x rX , y rA )] > A J E[F oc (x r ,y r )] - 2(A + l) 5 (2r) 

(2.15) 

> 2A(1 - e)(logr)/0* - 2(A + 1)[A + <51og(2r + 1)]. 

Since 5 < 0~ , it follows from (2.15) that there exists e small enough and 
A, r large enough such that E[G(x r \, y r \)] > and, hence, g lies in the linear 
domain. 

To show (b), pick 5 £ (fT 1 ^" 1 ). Since £fe!i exp[-0(<Jlog jfe)] < 00 and 

J2kLi ex P[ — ^t(^)] = 00 ) it follows that 7(fc) < (Hogfc for infinitely many k. 
Then for any e > 0, (2.15) holds for infinitely many r and (b) follows by 
choosing A, r large enough and e small enough. □ 

3. Large deviations and the strong law of large numbers. In this section 
we extend the large deviations and strong law results of Arratia and Wa- 
terman [3] and Zhang [20] to gap penalties satisfying limjfc_ +00 g(fe)/fc = 0, 
by-passing the Azuma-Hoeffding inequality that was central to their proofs 
for the case limfe_ i . 0O ^(A;)/A; > 0. 

Theorem 2. Letg(k) = A+j(k) for k>l, where 7(1) = and lim^oo 7 (k)/ log k ■ 
00. Then ip{9) = hin^^oo ip K (6)/ k is well defined, convex and finite for all 
9 > 0. Moreover, for all large A, 

(3.1) ip(9) = has a unique positive root 9. 

Under (3.1), the following also holds. 
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(a) If min{m, n}/c — > oo and log (mn) = o(c) asc— >oo ; i/ien 
(3.2) lim -c- 1 logP{#(x m ,y n ) > c} = 9. 

C — >OG 

(b) i/(x n ,y n )/logn — > 2/0 a.s. as rnoo. 

Proof. Let K, 77 be positive integ ers and consider x m ,y n with m,n^ 
K + 77. Let 

II(x m ,y n ) = {(yW vf, wW w f ) :Xm = v^v^y* = 

(3.3) 

with p,r > k, and g, s > 77} . 

In other words, Vp 1 ^ = x\ ■ ■ ■ x p , = x p+ \ ■ ■ ■ x m , = yi---y r and wi 2 ^ = 
Ur+i ■ "Un- For notational simplicity, we shall henceforth omit superscripts 
(1), (2) when describing members of II (x m , y n ). Since G K _|_^(x m ,y n ) — ^(x^^y^ 
for some z = {(i(t), j(t)) : 1 < t < k + 77} G Z K+V , it follows by selecting p = 
«(k+1)-1 and g = j(n+l)-l that G K+r? (x m ,y n ) = G K (v p , w r ) + G r; (v g , w s ) 
for some (v p , v 9 , w n w s ) G II(x m ,y n ). As /j(x m ) = fi(y p )fi(y q ) and //(y n ) = 
/j(w r )/j(w s ), the inequality 



exp [0 G K+r? (x m , y n ) ] /i (x m ) /U (y n ) 



(3.4) 



< 



E 



exp[0G K (v p , w r )]jLe(vp)//(w r ) 



(v p ,v«j,w r ,w s )en(x m ,y n ) 



x exp[6'G r? (vg,w s )]/i(v ? ) / u(w s 



holds because there exists a term on the right-hand side of (3.4) that is equal 
to the left-hand side. Noting that 

(3.5) (J II(x OT ,y n ) = {(v p ,v 3 ,w r ,w s ) :p,r > k and q,s>rj} 

(x m ,y„) : rn,n>K+ri 

and that the left-hand side of (3.5) is a disjoint union, we can conclude from 
(3.4) that, for all 9 > 0, 



ip K+v (9) = log ■ 
< log' 

(3-6) 



exp[0G K+r; (x m ,y n )]^(x m )^(y n ) 

(x m ,y„) : m,n>K+ri 



E 

(x m ,y„) : m,n>K+ri 



exp[6»G K+?? (vp,w r )] 

(Vp , Vq ,-Wr ,Ws ) Gll(x m ,y n ) 

X |U(Vp)//(w r ) 

x exp[0G^(v ?) w s )]/z(vg)/i(w s ) 
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= log< ^2 exp[6'G K (vp,w r )]^(vp)^(w r .) 

V (y p ,v q ,w r ,'w s ) : p,r>K and q,s>r/ 

x ex.p[9G ri (v q , w s )]/i(vg)^(w s )| 
= log< ^2 exp[0G re (v p ,w r )]/x(v p )//(w r ) 

v (v p ,w r ):p>K and r>r] 

x ^ exp[6»G r ,(v f/ ,w s )]/i(v g )/i(w s ) I 

(v 9 ,w s ):g>K and s>r) ' 

Moreover, as h K (6) > £exp[0G«(x„,y re )] = {Eexp[9K( Xl , yi)]} K and ijj K {9) = 
log h K (9), it follows that 

(3.7) ip K (9)/K>logEex.p{9K(x 1 ,y 1 )}> -co for all > and k > 1. 

The subadditive property (3.6) then ensures that V>(#) = lim^oo ip K (9)/n is 
well defined and finite. Since tp K {9) is convex for all k (see Section A.l), it 
follows that tp is convex and continuous. 

Pick an arbitrary positive 9 < 9*, where 9* is the unique positive root of 
the equation E exp[9K(x, y)] = 1. By (2.13), ipi(9) < for all large A and, 
hence, ip(0) < i^i{9) < 0. By (2.3) and (3.7), limg^ 00 ip(9) = oo and, hence, 
a positive solution 9 (> 9) of the equation tp(9) = exists. To show that 9 
is unique, it suffices from the convexity of ip to show that lim^o i/j(9) = 0. 
Since 

(3.8) G K (x r ,y s ) < fc^ max - g(r - k) -g(s-K), 
it follows that 

h K (9) < exp{9[KK mSLX - g(r - k) - g(s - k)]} 

r,s>K 

= ex.p(9nK ma , x )l ^2exp[-9g(k)]\ 
I k>o J 

and, indeed, ip(9) = lim K ^ oo [log/i K (0)]/K < 9K max — > as 9 — > 0. Moreover, 
by (3.7), i>{9) = lim*-*, V«(0)/« > logEexp[9K{xi, yi )} -> as 9 -» 0. 

(a) It follows from (3.1) that there exists K — > 9 such that ip K (0 K ) = for 
all large k. By Lemma 1 and as c" 1 log (mn) — > 0, it follows that 

(3.9) liminf-c^ 1 logP{i/(x m ,y n ) >c}> 9 K . 
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To get the opposite inequality, define 

(3.10) CM = suplog{£exp[0G K (x r ,y r )]}. 

Clearly, G K+V (x r+S ,y r+S ) > G K (x r , y s ) + G v (x r+1 ■ ■ ■ x r+s , y r+1 ■ ■ ■ y r+s ) and, 
hence, 

E exp[0G K+r; (x r+s ,y r+s )] > £ , exp[0G K (x r .,y r )]£'exp[0G r? (x s ,y s )]. 

By taking supremum over r and s, the superadditive property £ K+7? (0) > 
+ holds. Since i/j k (6) > £ re (0) for all k, it follows that ^(0) > 

lim /t _ >oo ^ re (0)/«;. It shall be shown in Section A. 2 that if g(k)/logk — > oo, 
then 

(3.11) = lim £«(0)/rc for all > 

and, hence, there exists 9 K ^ such that 

(3.12) tA) = 

for all large k. Let k satisfy (3.12). It follows from (3.8) that i?exp[0G K (x r , y r )] — > 
as r —¥ oo, and, hence, for all > 0, the supremum in (3.10) is attained at 
some r>K. By (3.12), it follows that 

(3.13) Eexp[9 K G K (x r ,y r )] = 1 for some r (= r K ). 

Let v n = G K (xr r _u v+ i ■ ■ ■ x rr/ , y( r -i)T)+i • • • Urri) and let Q be the measure un- 
der which vi,V2, ■ ■ ■ are independent with Q{v v = k} = exp(6 K k)P{vrj = k}. 
By (3.13), Q is a probability measure. Let T c = inf{£:X)n=i v -q > c l- Then 

(dQ/dP)(vi,. . .,v Tc ) = exp(0 K ^=i ^) < exp[0 K (c+K-fC m ax)] whenever T c < oo. 
Hence, 

(3.14) P{T C < A} > exp[-0 K (c + KK max )]Q{T c < A} 

for any positive integer A. Pick A= [min{m, n}/r J , where |_'J denotes the 
greatest integer function. Since min{m, ra}/c ^ oo, it follows that A/c^ oo. 
By the law of large numbers and as EqV\ > 0, we can conclude that Q{T C < 
A} -> 1. By (3.14) and as if(x m ,y„) > J2^=i v v so that {H(x m ,y n ) > c} D 
{?c < A}, it folllows that 

(3.15) limsup-c _1 logP{iJ(x m ,y n ) > c} <9 K . 

(a) then follows from (3.9) and (3.15) by letting k — > oo. 

(b) Let e > and select n large enough such that ip K (6) = has a positive 
solution K . Select a subsequence = \_k 2 / £ \ + 1. Then by Lemma 1, 

oo oo 

]T P{H(x nh ,y nk ) > (2 + e)(logn fc )/0 K } < exp[0 K ( K - l)K max ] £ < oo. 

k=l k=l 
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By the Borel-Cantelli lemma, it follows that lim sup^^ H(x nk , y rik ) / log n& < 
(2 + e)/9 K a.s. Since lognfc+i/logra/j — > 1 and H(x n ,y n ) is nondecreasing in 
n, it follows by choosing e arbitrarily small that 

(3.16) limsupiI(x n ,y ri )/logn < 2/9 K a.s. 

n— >oo 

Let k satisfy (3.12) and define a score matrix K on B := *4 r (= A Tr ) [see (3.13)] 

by setting K(x r ,y r ) = G K (x r ,y r ). Let x^ = x (r? „ 1)r+ i • • • x vr and yM = 

J/(„-l)r+l ■ ■ -Vvr f or all 1 < T] < A := [n/rj. Let ^(x^y^) = floo(x« • ••xWjW • • -yW) 

be the gapless local alignment score which treats X W,yW as letters of B and 

uses K as the score matrix. By (3.13), it follows that (2.4) holds with 

in place of and 9 K in place of 6 1 *. Hence, 

(3.17) UminfiJ(x n , y n )/ log n> lim # 00 (x rA ,y rA )/logA = 2/6* K a.s. 

n^oo A^oo 

(b) follows from (3.16) and (3.17) by letting k — > oo. □ 

4. Asymptotic number of matches in the optimal local alignment. For 

given sequences x n ,y n , let z be a candidate alignment satisfying 

(4.1) S z (x n ,y n ) = H(xn,y n ). 

The alignment z is not unique in general, but to be specific, we shall assume 
that there exists an ordering of the candidate alignments in Z and only the 
smallest alignment z with respect to this ordering that satisfies (4.1) shall 
be designated as the optimal local alignment and denoted by z*. Properties 
of the optimal local alignment are less stable than the local alignment score 
because a slight perturbation of the sequences, for example, changing one of 
the letters x% or yj, can result in a very different optimal local alignment. 

In this section our objective is to study |z* |, the number of matches in the 
optimal alignment z* . We shall show in Theorem 3 that under the assump- 
tions of Theorem 2, |z*| ~ 2 log n/9i()' (6) as n —* oo whenever the derivative 
ip'(0) exists. Since H(x 

ntYn) ~ 21ogre/$ by Theorem 2(b), this gives rise to 
the interpretation of ip'(9) as the asymptotic score per match of the optimal 
alignment. The convexity of ip ensures that i^'{9) exists with the exception 
of countably many 9. A more detailed discussion of the existence of ip'(9), 
involving measure theoretic issues, is dealt with in Section A. 3. 

Theorem 3. Let lim.k~>oo9(k)/logk = oo and assume (3.1) holds. If 
ip'{0) is well defined, then [z*|/logn — > 2/9ip'(9) a.s. 

Proof. Let K\ be a score matrix satisfying K\(a,b) = K(a,b) + A for 
all a, b 6 A. A superscript A in any notation defined previously will now be 
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used to signify that the score matrix K\ is used. If no superscript is used, 
it is understood that A = 0. Since 

XmiYn) — C re (x m ,y n ) + Xk for all 
(x m ,y n ), it follows from (2.8) that ip£ (6) = ip K (0) + Xk9 and, hence, 

(4.2) ^ A )(fl) = am i/j£\e)/K = tj}(e) + xe. 

By (4.2), ^(9^) + X9^ x \=^ x \9^) = Q = f(9). Since 0W -> 5 as A -> 0, it 
follows that */>'(0) = [^(0 {A) ) - ip(9)]/[9^ -9} + o(A) = -(l + o(l))A0/[0W - 
0] and, hence, 

(4.3) 0-^ A) = (l + o(l))A0/V / (0). 

Since F (A) (x n ,y n ) > Sv(x n ,y n ) = iT(x„, y n ) + A|z* | for all A, it follows by 
applying Theorem 2(b) on both iJ(x n ,y n ) and i?( A )(x n ,y n ) that 

limsup|z*|/logn< [(2/# (A) ) - {2/6)}/ X a.s. if A > 0, 

(4-4) 

liminf |zJ/logra> [(2/0 (A) ) - (2/0)]/ \ a.s. if A < 0, 

72 — >00 

and Theorem 3 follows from (4.3) by letting A — > in (4.4). □ 

APPENDIX 

A.l. On the convexity of ip K . Let 9 > 0. We can express ip K (9) = logEfc a fc x 
exp(bkO)] with > and bk distinct. Let a(9) = J2k a k exp(6fc#). Then 
<(0) = a'(9)/a{9) and <'(#) = [a"(0)/a(0)] - [a'(0)/a(#)] 2 - Let Z be a 
discrete random variable such that P(Z = bk) = afc exp(6fc0)/a(0). Then 
<(0) = £Z 2 - (£Z) 2 = Var(Z) > 0. 

A. 2. Proof of (3.11). By the arguments just before (3.11), it suffices to 
show that 

(Al) iP(8) < lim e 2 «(0)/2«. 

Let / r ( J(0) = £exp[0G K (x r ,y s )] so that h K (9) = E r , s > K and let £ K (9) = 

swp rs > K f$(9). Let e > 1 and 9 > 0. By (3.8), it follows that 

K(9)= £ f^(9) + 2 J2 E 

r,s<£ K +K r>e K +K and s<e K +K r,s>e K +K 



(A2) < (e K + 1) 2 4(#) + 2(e K + 1) exp(^K max ) ]T e 

fc>e K 



-8g{k) 



J2 e"" (fc) ) 



. k>e K 
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Since g(k)/logk —> oo, it follows that for any A > 1, g(k) > (Alogfc)/# for all 
k > e K when k is large and 

/oo 
x~ x dx 

K>s~ K >e~ - K / 2 

= (A-l) _1 (e K /2)- A+1 . 

Let K m i n = min ^g^4 K(a, b). Since £ re (0) > f^l{9) > exp(0KK min ), it follows 
by choosing A > 9(K max — if m i n )/loge that the second and third terms on 
the right-hand side of (A2) are dominated by the first term as k —> oo. 
Moreover, as 

(A4) G K (x( 1 ),yW) + G K (x( 2 ),y( 2 )) < ^x?),^)), 

it follows that fr*s(0)fs'Jr(0) < f^+sr+s(^) f° r au r, s > k and, hence, by 
taking supremum over r and s, we can conclude that [^ K (#)] 2 < ex P[^2«(^)]- 
Hence, by (A2), (A3) and the arguments above, 

(A5) i){6) = lim [\ogh K {9)]/K < 21oge+ lim &k(0)/2k. 

k — >oo K; — >oo 

(Al) follows by letting e — * 1 in (A5). 

A. 3. On the existence of ip'(9). Fix a gap penalty 5 such that g(k)/ log & - 
00 and let /C denote the space of all symmetric matrices on Ax A such that 
K m ax > and ijj(9) = has a unique positive solution 6. Induce a measure 
on /C via the Lebesgue measure on the upper triangular entries of K. Let 
C = {K 6 /C : does not exists}. We shall now show that C has measure 
zero. Consider the equivalence relation K\ ~ K<l if there exists A £ R such 
that 

(A6) Ki(a,b) = K 2 (a,b) + A for all a, b € A. 

Let the superscript K be used to signify the score matrix used. If (A6) holds, 
then tp( K ^(0) = ip( K ^(0) + \6 [see line before (4.2)] and, hence, ^ Kl) has a 
well-defined derivative at 6 if and only if ifj^ K2 ^ has a well-defined derivative 
at 6. By the convexity of tft, there are countably many members in each 
equivalence class such that ip'{9) is not well defined. If £ is measurable, then 
a direct application of Fubini's theorem would show that £ has measure 0. 
To show that £ is measurable, define the distance measure \\K — K*\\ = 
max a ^ e _4 \K(a,b) — K*(a,b)\. Then by the convexity of ip( K >, 

(A7) <5>0e>0 

e -i W + £ ) + ^ W W -s)]>6}, 
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where S,e varies over Since |Gjc (x m ,y n ) - Gk ; (x m ,y„)| < 

k\\K — K*\\ for all k, it follows that 

(A8) \^ {K) {0) - ip {K * ] (9)\ <9\\K- K*\\. 

By (A8), both ip( K ) and 9^ are continuous with respect to K and, hence, 

— e) are also continuous with respect to K. The 
sets defined in (A7) are open and £ is measurable. 
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